vignettes/TxRegInfra2.Rmd
TxRegInfra2.RmdTxRegQuery addresses exploration of transcriptional regulatory networks by integrating data on eQTL, digital genomic footprinting (DGF), DnaseI hypersensitivity binding data (DHS), and transcription factor binding site (TFBS) data. Owing to the volume of emerging tissue-specific data, special data modalities are used.
txregnet databaseThe README.md for this package describes how to populate a MongoDB instance with demonstrative data. We focus on the CRAN package mongolite as the interface to this data.
suppressPackageStartupMessages({
library(TxRegInfra2)
library(mongolite)
library(TnT)
library(EnsDb.Hsapiens.v75)
library(BiocParallel)
register(SerialParam())
})
con1 = mongo(url=URL_txregLocal(),
db="txregnet", collection="Lung_allpairs_v7_eQTL")
names(con1)## [1] "aggregate" "count" "disconnect" "distinct" "drop"
## [6] "export" "find" "import" "index" "info"
## [11] "insert" "iterate" "mapreduce" "remove" "rename"
## [16] "replace" "run" "update"
con1$find(limit=1)## gene_id variant_id tss_distance ma_samples ma_count maf
## 1 ENSG00000227232.4 1_860956_C_G_b37 831403 11 11 0.0143979
## pval_nominal slope slope_se qvalue chr snp_pos A1 A2 build
## 1 0.00123033 -0.921918 0.282688 0.08029171 1 860956 C G b37
Our aim is to produce tools based on Bioconductor idioms that answer questions about transcription regulation on the basis of documents stored in a MongoDB database.
There is not much explicit reflectance in the mongolite API. The following is not part of the formal API for the mongo package, but shows that the mongo instance may be queried for information about its origins.
try(parent.env(con1)$orig[c("name", "db", "url")])## $name
## [1] "Lung_allpairs_v7_eQTL"
##
## $db
## [1] "txregnet"
##
## $url
## [1] "mongodb://127.0.0.1:27017"
MongoDB is a schemaless technology. A ‘database’ in MongoDB is a family of named ‘collections’, and collections can be searched using the ‘find’ operation.
We can only use this package on systems where the mongod service is running and accepting connections.
We can get a list of collections in the database as follows.
con1$run('{"listCollections":1}')$cursor$firstBatch[,"name"]## [1] "vjc2" "vjc1"
## [3] "fPlacenta_DS20346_hg19_FP" "fLung_DS14724_hg19_FP"
## [5] "M5946_1_02_tf" "ENCFF001WBZ_hg19_HS"
## [7] "Lung_allpairs_v7_eQTL"
For a single record from a given collection:
mongo(url=URL_txregLocal(), db="txregnet",
collection="Lung_allpairs_v7_eQTL")$find(limit=1)## gene_id variant_id tss_distance ma_samples ma_count maf
## 1 ENSG00000227232.4 1_860956_C_G_b37 831403 11 11 0.0143979
## pval_nominal slope slope_se qvalue chr snp_pos A1 A2 build
## 1 0.00123033 -0.921918 0.282688 0.08029171 1 860956 C G b37
Queries can be composed using JSON. We have a tool to generate queries that employ the mongodb aggregation method. Here we demonstrate this by computing, for each chromosome, the count and minimum values of the footprint statistic on a sample of placental cells.
m1 = mongo(url = URL_txregLocal(), db = "txregnet", collection="fPlacenta_DS20346_hg19_FP")
newagg = makeAggregator( by="chr", vbl="stat", op="$min", opname="min")The JSON layout of this aggregating query is
[
{
"$group": {
"_id": ["$chr"],
"count": {
"$sum": [1]
},
"min": {
"$min": ["$stat"]
}
}
}
]
Invocation returns a data frame:
head(m1$aggregate(newagg))## _id count min
## 1 chr10 72826 0.00870233
## 2 chr4 72074 0.00986576
## 3 chr5 83737 0.00674804
## 4 chr9 66610 0.01753630
## 5 chr17 67314 0.01207180
## 6 chr6 100439 0.01964700
We need to bind the metadata and information about the mongodb. NB: We may want to utilize MultiAssayExperiment.
The following turns a very ad hoc filtering of the collection names into a DataFrame.
cd = TxRegInfra2::basicColData.tiny
head(cd,2)## DataFrame with 2 rows and 3 columns
## base type mid
## <character> <character> <character>
## ENCFF001WBZ_hg19_HS ENCFF001WBZ HS hg19
## fLung_DS14724_hg19_FP fLung FP DS14724_hg19
rme0 = RaggedMongoExpt(con1, colData=cd)
rme1 = rme0[, which(cd$type=="FP")]A key method in development is subsetting the archive by genomic coordinates. This is accomplished with sbov, which is an early implementation of the (planned) subsetByOverlaps generic.
si = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] # to fix query genome
myg = GRanges("chr17", IRanges(38.07e6,38.09e6), seqinfo=si)
s1 = sbov(rme1, myg, simplify=FALSE)## ..
s1## class: RaggedExperiment
## dim: 111 2
## assays(3): chr id stat
## rownames: NULL
## colnames(2): fLung_DS14724_hg19_FP fPlacenta_DS20346_hg19_FP
## colData names(6): base type ... type mid
#dim(sa <- sparseAssay(s1, 3)) # compact gives segfault
sa = as(s1, "GRangesList")
sa## GRangesList object of length 2:
## $fLung_DS14724_hg19_FP
## GRanges object with 30 ranges and 3 metadata columns:
## seqnames ranges strand | chr id stat
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] chr17 38083253-38083260 * | chr17 id-454581 0.761905
## [2] chr17 38083323-38083335 * | chr17 id-454582 0.944444
## [3] chr17 38083535-38083541 * | chr17 id-454583 0.805556
## [4] chr17 38083577-38083590 * | chr17 id-454584 0.943787
## [5] chr17 38083593-38083606 * | chr17 id-454585 0.933258
## ... ... ... ... . ... ... ...
## [26] chr17 38084078-38084088 * | chr17 id-454606 0.829365
## [27] chr17 38084097-38084104 * | chr17 id-454607 0.800000
## [28] chr17 38084110-38084150 * | chr17 id-454608 0.880682
## [29] chr17 38084160-38084169 * | chr17 id-454609 0.533333
## [30] chr17 38084924-38084952 * | chr17 id-454610 0.890476
## -------
## seqinfo: 1 sequence from hg19 genome
##
## $fPlacenta_DS20346_hg19_FP
## GRanges object with 81 ranges and 3 metadata columns:
## seqnames ranges strand | chr id stat
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] chr17 38073443-38073472 * | chr17 id-600855 0.657201
## [2] chr17 38074829-38074840 * | chr17 id-600856 0.777273
## [3] chr17 38074866-38074872 * | chr17 id-600857 0.916667
## [4] chr17 38074882-38074908 * | chr17 id-600858 0.817949
## [5] chr17 38074913-38074920 * | chr17 id-600859 0.874060
## ... ... ... ... . ... ... ...
## [77] chr17 38089527-38089535 * | chr17 id-600931 0.894269
## [78] chr17 38089573-38089589 * | chr17 id-600932 0.790179
## [79] chr17 38089599-38089606 * | chr17 id-600933 0.536646
## [80] chr17 38089671-38089677 * | chr17 id-600934 0.582996
## [81] chr17 38089691-38089728 * | chr17 id-600935 0.642579
## -------
## seqinfo: 1 sequence from hg19 genome
ormm = txmodels("ORMDL3", plot=FALSE, name="ORMDL3")## Loading required namespace: biovizBase
## Loading required namespace: Gviz
#sar = strsplit(rownames(sa), ":|-")
dat = unlist(sa)
dat$score = 1-dat$stat
dat = split(dat, names(dat))
dat[[1]]$value = dat[[1]]$score # for TnT
dat[[2]]$value = dat[[2]]$score
d1 = dat[[1]]
width(d1) = 1
d2 = dat[[2]]
width(d2) = 1
names(d1) = seq_len(length(d1)) # for TnT, can't have duplicated rownames
names(d2) = seq_len(length(d2))
pt1 = PinTrack(d1)
pt2 = PinTrack(d2)
data(tnt_genetrack_hg19)
data(tnt_txtrack_hg19)
vr = GRanges("chr17", IRanges(38.05e6, width=50000))
TnTGenome(list(pt1,pt2,tnt_genetrack_hg19,tnt_txtrack_hg19), view.range=vr)